{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Lab 10 - Sampling distributions, Part 2\n", "\n", "For this lab, we will use a list of the top 1000 movies on [IMDb](https://www.imdb.com) compiled by [Kevin Markham](https://www.dataschool.io/about/) several years ago. \n", "\n", "The data CSV file is on GitHub here: [imdb_1000.csv](https://github.com/justmarkham/DAT8/blob/master/data/imdb_1000.csv) To download, right lick on Raw and save the CSV file.\n", "\n", "### Sampling and empirical distributions of statistics\n", "\n", "A *sampling distribution* of a statistic (mean, median, varaince, etc.) is the distribution of that statistic over all possible samples of the same size. Since it's impractical to compute all possible samples, we will compute the statistic of some random samples which gives us the *empirical distribution* of the statistic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, we will import the matplotlib and pandas packages, and set plots to appear in the Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the CSV file into a dataframe called `movies`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the dataframe was created properly by displaying it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the columns. Which columns contains quantitative data? These columns are the only ones we can compute the mean of. \n", "\n", "We are going to take 10 random samples of size 50 of our dataframe `movies`, take the mean star rating for each sample, and plot a histogram of these means. First we create an empty list to store the means. Type `means = []` below and run the code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, can you write the code to take a sample of size 50 from `movies` and compute the mean of the `star_rating` column in the sample? Look back at lab 7 if necessary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " sample = movies.sample(50)\n", "sample[\"star_rating\"].mean()\n", "
\n", "\n", "Great! Now try putting your code inside of a loop, so that it repeats 10 times." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " for i in range(10):\n", " sample = movies.sample(50)\n", " sample[\"star_rating\"].mean()\n", "
\n", "\n", "To make the code print the means, we have to change the last line of code to use the `print()` function, like this: `print(sample[\"star_rating\"].mean())` Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " for i in range(10):\n", " sample = movies.sample(50)\n", " print(sample[\"star_rating\"].mean())\n", "
\n", "\n", "Now, instead of printing the means, we want to save them to our list `means`. We do this with the `append()` function: `means.append(sample[\"star_rating\"].mean())`\n", "\n", "Copy your loop below and change it to add the means to the list instead of printing them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " for i in range(10):\n", " sample = movies.sample(50)\n", " means.append(sample[\"star_rating\"].mean())\n", "
\n", "\n", "You can see the means by typing `means` below, which will display the contents of the list." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make a histogram of these means, we have to first convert the list into a Pandas Series, and then we can make the histogram. Type `pd.Series(means).hist()` below and run the code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try this again, but with 200 samples so that we get a better histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " \n", "means100 =[]\n", "for i in range(10):\n", " sample = movies.sample(50)\n", " means100.append(sample[\"star_rating\"].mean())\n", "pd.Series(means100).hist()\n", "
\n", "\n", "What do you notice?\n", "\n", "Now let's take 100 samples of size 50 and take the variance of each sample instead of the mean. What does the histogram (empirical distribution) look like? Does it have the same shape as for the mean?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " \n", "variances =[]\n", "for i in range(10):\n", " sample = movies.sample(50)\n", " variances.append(sample[\"star_rating\"].var())\n", "pd.Series(variances).hist()\n", "
\n", "\n", "#### Challenges:\n", "- What happens if you leave the number of samples the same, but increase the size of the samples?\n", "- What does the sampling distribution of the median star rating look like?\n", "- What do the sampling distributions of the mean, median, and variance of the duration look like? How does this compare to sampling distributions of the mean, median, and variance of the star ratings?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.6" } }, "nbformat": 4, "nbformat_minor": 2 }